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Abstract. In this paper we describe the current status of the GRAPE-6 
project to develop a special-purpose computer with a peak speed exceed- 
ing 100 Tflops for the simulation of astrophysical A-body problems. One 
of the main targets of the GRAPE-6 project is the simulation of dense 
stellar systems. In this paper, therefore, we overview the basic algorithms 
J> ■ we use for the simulation of dense stellar systems and their characteris- 

tics. We then describe how we designed GRAPE hardwares to meet the 
requirements of these algorithms. GRAPE-6 will be completed by the 
year 2001. As an example of what science can be done on GRAPE-6, 
• we describe our work on the galactic center with massive black holes 

<^ , performed on GRAPE-4, the predecessor of GRAPE-6. 
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Direct A-body simulation of star clusters or other stellar systems has proven 
itself an extremely powerful tool to study the structure and evolution of stel- 
lar systems, since the pioneering work by von Horner (1960, 1963| ). The only 



known way to do experiments on stellar systems is to construct their models in 
computers, because we cannot do laboratory experiments on stellar systems, 
Of course, A-body simulation is not the only way to construct computer 



models of star clusters. One could use Monte-Carlo (Henon 1971, for recent 



development see Giersz 1998 ) or direct integration of Fokker-Planck (FP) equa- 



tions (Cohn 198C, Takahashi 1996, Drukier et al. |1999| ). In particular, recent 



advance in the treatment of the two-dimensional [f(E, J)] FP equation made it 
possible to use FP code to the study of the evaporation of clusters in the tidal 
field of the parent galaxy. 

The main advantage of these methods over direct A-body simulation is the 
calculation cost. While it is still out of reach to perform the direct simula- 
tion of, say, a typical globular cluster with A = 10 6 , the FP equation is valid 
in the limit of A — > oo. Therefore, when the assumption of large- A limit is 
good, methods based on the FP approximation can deliver reliable results for 
the calculation cost orders of magnitudes smaller than that of direct A-body 
calculation. In principle, one can do the A-body simulation with small A and 
then try to extrapolate that result to larger A. However, this problem, which is 
sometimes called "the scaling problem" has proven itself a complex and difficult 
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problem(Heggie et al. 1998). The main reason of the difficulty is that there 



are many physical processes of which timescales depend on the number of parti- 
cles in different ways, and that there is no way to adjust all relevant timescales 
consistently. 

On the other hand, methods based on FP approximation do have their own 
limitations. For example, it's pretty difficult to extend the formulation beyond 
the spherical symmetry. Moreover, they also suffer the problem of the time 
scaling, only difference is that they are affected from the opposite direction. For 
example, when we want to include the dynamical effect of the slowly varying 
potential, FP approximation goes into trouble. Consider the tidal stripping of 
globular clusters with non-circular orbit. The orbital timescale is of the order of 
10 8 years. The orbital timescale of the stars around the tidal boundary of the 
cluster is also 10 8 years. On the other hand, the half-mass relaxation time is 
of the order of 10 9 years. Therefore, the timestep to integrate the FP equation 
is at the largest 10 8 years, and in practice much smaller than that. Clearly, 
the assumption that the dynamical timescale is smaller than the Fokker-Plank 
timestep is broken, and FP calculations tends to grossly overestimate the escaper 
rate. 

As beautifully demonstrated by Takahashi and Portegies Zwart ( 1998| ), one 
can incorporate the escaping timescale into FP calculation. However, in order 
to do so the model parameter must be adjusted so that the agreement with 
reference A^-body calculations is achieved. Moreover, this agreement has been 
so far achieved only with iV-body simulations with simple static "tidal cutoff" , 
where no tidal field is imposed and stars outside the tidal radius are just removed. 
Whether or not FP calculations can reproduce the result of iV-body calculations 
with tidal fields is an open question. 

The ultimate solution for the scaling problem is to perform TV-body simu- 
lations with sufficiently large number of particles. In the first half of this paper, 
we describe our effort in that direction, the GRAPE project to develop and use 
special-purpose computers for iV-body simulations. In the last half of this paper, 
we'd like to discuss briefly our recent work on iV-body simulations of galactic 
centers with massive black holes, where again the scaling problem shows up. 



2. Direct simulation of dense stellar systems 

In principle, the direct TV-body simulation of stellar systems is very simple and 
straightforward. The only thing one has to do is to numerically integrate the 
following system of equations of motion: 

3+^ 1 J 1 

where Xj and mi are the position and mass of the particle with index i and G is 
the gravitational constant. The summation is taken over all stars in the system. 

In practice, we need complex methods and tricks to integrate the above 
equation. Since we cannot cover all important issues here, we recommend the 
reviews by Spurzem ( 11999ft and Aarseth ( [19994 |1999b| ). Here, we briefly dis- 



cuss issues directly related to the use of special-purpose hardwares for iV-body 
problem. 
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As we wrote above, in principle an iV-body simulation is simple and straight- 
forward. At each timestep, we calculate the forces on all particles in the system, 
and integrate their orbits using some appropriate integration method. In fact, in 
Molecular Dynamics simulation, where one solves the iV-body problem of atoms 
interacting through Coulomb and van der Waals forces, one can rely on this 
approach. In astrophysics, however, the nature of the gravitational interaction 
makes such approach impractical. 

The problem is that the gravitational force is an attractive force with no 
characteristic scale length. This fact leads to three complications. The first 
one is that the inhomogeneity develops as the system evolves. In the case of 
star clusters, this is known as core collapse or gravitational catastrophe. The 
central core of the cluster evolves to higher and higher density, while its mass 
decreases. The timestep has to be small enough to integrate accurately the 
orbits of particles in the core. Therefore, the timestep decreases as the cluster 
evolves. The second one is that even when the core density is not very high, 
random close encounters of two particles can lead to arbitrary short timesteps. 
The third one is that stable binaries and more complex hierarchical systems 
are formed through three-body encounters and encounters of larger number of 
particles (such as binary-binary interactions). These systems have very small 
orbital timescales. For a quantitative analysis of these issues, see Makino and 
Hut (HH, |i~990D . 

Fortunately, we can handle these complications by a combination of tech- 
niques. The first two are solved by assigning each particle its own time and 
timestep. This scheme, the individual timestep scheme, was first introduced by 
Aarseth ( |1963| ), and has been used for four decades. 

In the individual timestep scheme, each particle has its own timestep Ati 
and maintains its own time ij. To integrate the system, one first selects the 
particle for which the next time (ij + Ati) is the minimum. Then, one predicts 
its position at this new time. Positions of all other particles at this time must be 
predicted also. Then the force on that particle from other particles is calculated. 
The position and velocity of the particle is then corrected. The new timestep is 
also c alculated. The integration scheme is a variation of Krogh's scheme (Krogh 
1974 ) modified for second-order equations, f 



A modification of this individual timestep algorithm is now used to achieve 
higher efficiency on vec tor machines (McMillan |1986|) and special-purpose com- 
puters (Makino 1991a ). This scheme is called the blockstep scheme. In this 
scheme, the timesteps of particles are forced to integer powers of two. In addi- 
tion, the new timestep of a particle is chosen so that the time of the particle is an 
integer multiple of the new timestep. These two criteria make it possible to force 
multiple particles to share exactly the same time. As a result, the efficiency of 
vector /parallel hardware is improved significantly. However, it should be noted 
that the average number of particles which share the same time is not very large, 
in particular when the central core is small and dense. Therefore, it is necessary 
that the force calculation procedure can achieve reasonable performance when 



1 Though this scheme is usually c alled Krogh's scheme in the field of numerical analysis, to our 
knowledge the work by Wielen (1967) is the first to discuss the implementation of individual 
timestep with an arbitrary order integration scheme. 
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Figure 1. Basic idea of GRAPE 
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the number of particles to be integrated in parallel is small. Since the number 
of particles in the core can be as small as 100 or less, it is necessary that the 
force calculation procedure can achieve a reasonable speed for that number. 

The third problem, the stable binaries and other hierarchical systems, is 
solved essentially by integrating them as small- N subsystem embedded in the 
cluster. The motion of the center of mass of the subsystem and internal mo- 
tion of its members are separated, and integrated independently with different 
timesteps. If the perturbations from the rest of the system is small, we can 
ignore it. In the case of a binary, this means we can completely skip the numer- 
ical integration of the internal motion, since we know the analytic solution. If 
the perturbation is not negligible but is small, we can apply various techniques 
such as "Slow KS" (Mikkola and Aarseth 1996| ). Here, the hardest problem is 
not really the calculation cost. The challenge is how we can recognize these 
subsystems and their internal structures, and how we can decide what method 
is appropriate for that particular structure. 



3. GRAPE for the simulation of dense stellar systems 

We have developed a series of special-purpose computers for A-body simulations, 
which we call GRAPE (GRAvity PipE). Figure |] gives the basic idea. The 
host computer, which is usually a general-purpose workstation running UNIX, 
send the positions and masses of particles to the GRAPE hardware. Then 
GRAPE hardware calculates the interaction between particles. What GRAPE 
hardware calculates is the right hand side of equation ([l]). When we use the 
Hermite scheme (Makino 1991b| ) , the first time derivative of the force must also 



be calculated. In this case, velocities of particles are needed. Figure g shows the 
block diagram of the pipeline unit to calculate the force and its time derivative. 

In order to combine the individual timestep scheme with GRAPE hard- 
ware, one modification of the basic architecture is necessary. Figure || shows the 
change. As described in the previous section, we have to predict the position 
(and velocity in the case of the Hermite scheme) of all particles to calculate the 
forces on the particles in the current blockstep. This prediction must also be 
done on the GRAPE hardware, since otherwize the amount of the calculation 
the host computer has to do becomes too large. 

In the modified architecture shown in figure |3|, the particle memory keeps all 
data necessary for prediction, for all particles in the system. At each blockstep, 
the host computer writes the position and velocity of particles to be updated, 
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Figure 2. Pipelined processor for the calculatio n of f orce and its time 
derivative. Reproduced from Makino et al. 1997 1997 ), 
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Figure 3. GRAPE for individual timestep 



and GRAPE calculates the forces on them and sends the results back to the 
host. If the number of pipelines is smaller than the number of particles in the 
block, this step is repeated until the forces on all particles in the block are 
obtained. Then the host performs the orbit integration using these calculated 
forces, and updates the data of the integrated particles in the particle memory 
of the GRAPE hardware. 



GRAPE-4 (Makino et al. |1997|) is the first GRAPE hardware to implement 
this modified architecture. In GRAPE-4, one processor board houses one pre- 
dictor units and 96 (virtual) force calculation pipelines. The total system in the 
maximum configuration consisted of 36 boards organized into four clusters, and 
different boards calculated the forces on the same set of 96 particles. In this 
way, we met the requirement that the number of forces calculated in parallel is 
small, even though the number of pipelines is large. 

Summation of 9 forces from processor boards in the same cluster is taken 
care by the communication hardware, and final summation of the forces from 
four clusters is handled by the host. 

GRAPE-4 was completed in 1995, and has been used by many researchers 
for the study of dense stellar systems. 



6 



Makino 




4. GRAPE-6 

In 1997, we started the GRAPE-6 project. It's a five-year project funded by 
JSPS (Japan Society for the Promotion of Science), and the planned total budget 
is about 500 M J YE. 

The GRAPE-6 is essentially a scaled-up version of GRAPE-4(Makino et al. 

1997|) , with the peak speed exceeding 100 Tflops. It will consist of around 3000 
pipeline chips, each with the peak speed of 40 Gflops. In comparison, GRAPE-4 
consists of 1700 pipeline chips, each with 600 Mflops. The increase of a factor 
of 100 in speed is achieved by integrating six pipelines into one chip (GRAPE-4 
chip has one pipeline which needs three cycles to calculate the force from one 
particle) and using 3-4 times higher clock frequency. The advance of the device 
technology (from 1/xm to 0.25//m) made these improvements possible. Figure 
^ shows the first sample of the processor chip delivered in early 1999. The six 
pipeline units are visible. 

Figures [| and || shows the processor board with 16 processor chips and the 
prototype four-board system. This four-board system has the theoretical peak 
speed of 2.1 Tflops, and has achieved the sustained speed of 1.1 Tflops for the 
simulation of 1 million-body system. 

GRAPE-6 will be completed by the year 2001. We plan to make small 
version of GRAPE-6 (peak speed of around one teraflops) commercially available 
by that time. We've found that the commercial availability of small machines is 
essential to maximize the scientific outcome from GRAPE hardwares. 

Compared to GRAPE-4, GRAPE-6 will give us 100 times more computer 
power. For simulation of star clusters for the relaxation timescale, this means 
a factor of five increase in the number of particles we can handle. For short 
simulations, the increase would be a factor of 10. In the case where we can 
use tree algorithms, in principle a factor of 100 increase is possible if the host 
computer has a sufficiently large memory. Table 1 gives a rough idea of what is 
currently possible with GRAPE-4 and what will be possible soon with GRAPE- 
6. 
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Figure 5. The processor board of the GRAPE-6 with 16 processor 
chips. Two processor chips are mounted on modules, on which four 
memory chips are also mounted. One board houses eight modules. 




Figure 6. The prototype system with four processor board. 
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Table 1. Particle Number of Simulation Feasible on GRAPE-4 and 6 



Problem Aria 


GRAPE-4 


GRAPE 


Planet Formation 


5 x 10 4 


10 b 


Star Cluster 


5xl0 4 


3xl0 5 


Black Hole Binary in Galactic Nucleus 


10 6 


10 7 


Galaxy Evolution & Interactions 


2xl0 6 


10 8 


Large Scale Structures 


3xl0 7 


3xl0 9 



5. BH binaries in galactic cores 

In this section, we briefly describe our resent work on the massive central black 



holes in the centers of galaxies (Makino and Ebisuzaki 1996 , Makino 1997 



Nakano and Makino |19994 |1999b|) . The probl em is what happens when two 



galaxies, each with a central black hole, merge. The merging of two galaxies 
is a rather common event. According to the standard scenario of the struc- 
ture formation in the universe (hierarchical gravitational clustering), galaxies 
are constructed "bottom up" from smaller objects through merging. On the 
other hand, almost all galaxies except for some dwarfs seem to have central 
black holes. Thus, if two such galaxies merge, the merger remnant would have 
two black holes. These black holes would settle in the center of the merger 
remnant, and would eventually form a binary. Two questions arise: (a) What 
is the structure of the central region of the merger galaxy with two black holes? 
and (b) What will happen to the binary itself? Will two black holes eventually 
merge? 



Makino and Ebisuzaki ( 1996| ) studied the first problem, assuming that black 



holes would eventually merge. They performed the simulation of hierarchical 
(repeated) merging similar to those by Farouki et al. ( |1983| ), but with central 
black holes. What Farouki et al. found is that the core radius remains almost 
unchanged after merging, even though the half-mass radius almost doubles af- 
ter each merging event. This result is consistent with the theoretical prediction 
based on the conservation of the central phase space density, but in complete con- 
tradiction with the observations of luminous elliptical galaxies. Ground-based 
observations in 1980s demonstrated that there is strong correlation between the 
effective radius and the "core " radi us of luminous ellipticals (Lauer 1985|) . HST 



observations (Gebhaldt et al. 1996 ) have shown that the "cores" are actually all 
cusps with the profile p = r -0 '"^ , where p is the volume luminosity density. 

Figure [7| show the result of simulations with GRAPE-4. The central region 
of the merger with central black hole looks like the shallow cusps observers found 
in luminous ellipticals, and the radius of this cusp region is a constant fraction 
of the half-mass radius. Thus, numerical simulations of merging of galaxies 
with black holes reproduced the observed characteristic of the central regions of 
luminous ellipticals rather well. 



Nakano and Makino ( |1999a , 1999b| ) investigated why the cusp is formed. 



First, they tried to understand the formation mechanism better by means of 
simplified experiments, in which just one black hole was let to sink to the center 
of a galaxy. Except for the case that the black hole is originally in the center, 
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Figure 7. The density profiles of the mergers with central black holes. 
Profiles are scaled so that the half-mass radii are the same for all rem- 
nants. The panel in the left side shows the results of runs with central 
black holes, and that in the right without black holes. Reproduced 
from Makino and Ebisuzaki (1997). 



the cusp with p = r -0 ' 5 is universally formed, and the mass within the cusp 
region is comparable to that of the mass of the central BH. 

In the second paper (Nakano and Makino 1999b), they came up with a 
simple explanation for the formation mechanism. The black hole sink to the 
center through the dynamical friction, in the dynamical timescale. When the 
black hole settled at the center of the galaxies, almost no star was strongly bound 
to the black hole. This is because the black hole sinks to the center relatively 
slowly. The sinking timescale is not so slow that the adiabatic approximation 
can be applied to the binding energy of the stars, but sufficiently slow that 
the binding energy of most of the stars does not change very much. Thus, the 
distribution function f(E) has a sharp cutoff at the energy close to the central 
potential depth of the initial galaxy model. 

If there is a central massive object and f(E) has a sharp cutoff, we can 
show that the central region is a cusp with the slope —0.5 as follows. 

The density is obtained by integrating the distribution function in the ve- 
locity space as follows: 

p(r) = 4vr [° f(E)J2\E-^(r)\dE, (2) 

Jtp(r) 

where ip(r) is the potential at distance r from the center, where the black hole 
lies. Note that we assumed that the velocity distribution is isotropic. 

We assume that f(E) = if E < —Eq. For any r with ip(r) < Eq, equation 
(|2|) can be rewritten as 

p{r) = 4vr [ f(E)J2\E-i,(r)\dE, (3) 
JEo v 
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which can be expanded as 

p(r) = AV2^M{r)\ [ f{E) 

Therefore, we have 

p{r) oc ^r)\ ~ ( 5 ) 

For the fate of the black hole binary, we performed simulations with very 
wide range of the number of particles (2,048 to 262,144) to investigate the de- 
pendence of the growth timescale of the binding energy of the binary on the 
number of particles in the galaxy. What we found was intriguing. The timescale 
seemed to be proportional to A 1 / 3 , while, theoretically, the timescale must be 
proportional to N, since the timescale should be limited by the timescale of 
filling the loss cone through two-body relaxation. We have not yet understood 
why our numerical experiments did not agree with the theoretical prediction. 
Simulations with larger N, which will be possible with GRAPE-6, will give us 
important clues for the understanding of this problem. 
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6. Discussion 



In this paper, we briefly overviewed the computational issues of direct A-body 
simulations and the use of special-purpose computers, and we reviewed the plan 
and development status of GRAPE-6. We believe GRAPE-6 will give us many 
new important scientific results, much in the same way as GRAPE-4 have done 
so in the five years since its completion. 

Since GRAPE-6 is now close to its completion, it might be worthwhile 
to speculate on what will come next. The silicon device technology has been 
advancing as usual, and will do so at least for one more decade or two. If we will 
start the development of the new system just after the completion of GRAPE-6, 
we will use the technology safely available by 2002, which is either 0.13 or 0.09/im 
technology. This will give us around a factor of six increase in the transistor 
count. For the clock speed, we expect that the improvement by a factor of three 
will not be too hard, mainly because the current clock of GRAPE-6 chip is not 
very high. Thus, we can achieve about a factor of 20 improvement in the speed 
over that of GRAPE-6 chip, reaching around 0.7 Tflops per chip. With three 
thousand chips, the peak speed would reach two Petaflops. 

The target number of particles for such system would be around 1 million, 
for star cluster simulations. For that number, however, one might ask whether 
or not the direct force calculation is really a good approach, even though it 
guarantees very good cost-performance ratio. 

There are two widely used algorithms to reduce the calculation cost. One 
is the neighbor scheme (Ahmad and Cohen 1973] ), and the other is the treecode 
(Barnes and Hut 1986| ). For general-purpose computers, both algorithms give 
pretty large improvement in the overall calculation speed. The theoretical gain 
of the neighbor scheme is 0{N 1 ^ A ) (Makino and Hut 1988| ), and the break-even 
point is as small as N = 25. For the treecode, the gain is 0(N/ log N). The 
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break-even point depends on the required accuracy, but N = 10 6 is certainly 
large enough to guarantee a pretty large gain. 

However, we (or anybody else) have not yet conceived a feasible way to 
combine the neighbor scheme and GRAPE, or treecode with individual timestep 
and GRAPE. In both cases, the problem is that forces on different particles are 
calculated differently. As a result, n pipelines which calculate the forces on 
n different particles need n different data from the particle memory, while in 
the case of the simple direct summation all pipelines can use the same data. 
In principle, if each pipeline has its own memory, we can reconcile massively 
parallel GRAPE and these algorithms. With the next-generation system, it will 
be necessary to integrate the memory and the pipeline processor anyway, since 
otherwize we cannot achieve required memory bandwidth even for the case of 
the simple GRAPE architecture. This integrated memory and pipeline might 
give us the opportunity to implement more efficient force calculation algorithms. 
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